Reproducible workflows with targets

Pipeline tool for data science in R

Aud Halbritter

The problem?

The workflow of data science

Data analysis workflow with dependencies and multiple outcomes.

Updating the workflow

Change in one component requires updates to part or the whole workflow.

The solution

The targets package

  • Pipeline tool for statistics and data science in R

  • Reproducible workflows avoiding repetition

  • Skips costly running time for up to date tasks

When is targets useful?

  • Code has long runtimes (slow or complex)

  • Interconnected tasks with dependencies

  • Different outputs (e.g. presentation and report)

How does it work?

File structure

  • Code

    • Functions
  • _target.R file

  • Data

  • make.R file

  • Manuscript

_target.R script file

The targets file configures and defines the pipeline.

  • load packages (targets ++)

  • source functions

library(targets)
tar_option_set(packages = c("readr", "dplyr", "ggplot2"))

source("R/functions.R")

The use_targets() function can set up the targets file.

targets objects

  • list of target objects (data, result, figure)

  • create target objects with tar_target()

list(
  
# Bootstrapping
  tar_target(
    name = trait_mean,
    command = make_bootstrapping(community, traits)
  ),
  # make figure
  tar_target(
    name = trait_mean_figure,
    command = make_trait_mean_figure(trait_results, trait_mean)
  )
  
  ...
)

The output

list(

  ...
  
# render ms
manuscript_plan <- list(tar_render(name = ms, path = "manuscript/manuscript.qmd"))
)

Each target is a step of the analysis and will be stored as a value in the _targets/objects/

manuscript.qmd file

---
title: "Plant functional traitresponses to climate warming in an Arctic environment on Svalbard"
author: Aud Halbritter and PFTC4 consortium
format: html
bibliography: bibliography.bib
editor: 
  markdown: 
    wrap: sentence
---

Run the pipeline

Separate make.R file:

###############################
#### Make targets ####
###############################

library("targets")

# make the targets that are out of date
# looks for a file called "_targets.R" in the working directory
tar_make()

Inspect the pipeline

Use tar_manifest(fields = all_of("command")) to check for errors.

And tar_visnetwork() to visualise the dependency graph.

Start small

  • Start small and build on it.

  • Add small steps, one at the time, check and add the next step.

Exercise

Download this template repository.

  • get it running

  • do changes to it

library(usethis)

use_course("biostats-r/targets_workflow_svalbard")

Further reading / watching